In Boston, the administrative hierarchy can be represented as follows:
1. Country: United States
- The highest level of administrative division, encompassing the entire country.
2. State: Massachusetts
- The state in which Boston is located.
3. County: Suffolk County
- The county in which Boston is located. Suffolk County includes the city of Boston and some neighboring areas.
4. City: Boston
- The city of Boston itself, which is the capital and largest city of Massachusetts.**
5. Neighborhoods/Districts: Boston is further divided into several neighborhoods or districts, each with its own characteristics and local governance. Some of the well-known neighborhoods in Boston include:
A1: Downtown
A15: Charlestown
A7: East Boston
B2: Roxbury
B3: Mattapan
C6: South Boston
C11: Dorchester
D4: South End
D14: Brighton
E5: West Roxbury
E13: Jamaica Plain
E18: Hyde Park
These administrative divisions outline the hierarchical structure of Boston's governance and provide a framework for managing and providing services to different areas within the city.
The Crimes in Boston dataset capturing the type of incident as well as when and where it occurred, a well-known dataset in the field of machine learning and data analysis, contains measurements of different fields of crime given by BPD. This project aims to perform an in-depth exploratory data (EDA) and statistical analysis of the crime dataset to gain insights into the characteristics of these incidents that are provided by Boston Police Department to initial details surrounding an incident to which BPD officers respond and capturing the type of incident as well as when and where it occurred in and make meaningful conclusions.
Related of data collection of stated problem:
Create project plan and product backlog
Objective:
To perform an in-depth of the Crimes in Boston dataset, you can follow these steps:
1. Define project objectives:
2. Data collection and loading:
3. Exploratory data analysis:
4. Statistical analysis:
5. Hypothesis Testing:
6. Data preparation and cleaning:
7. Model Development:
8. Model Validation:
9. Documentation:
10. Deployment and Integration:
11. Testing and Quality Assurance:
12. Maintenance and Monitoring:
13. Report writing:
14. conclusion:
Create Git Repository
My Git Repository
Phase-2 (Summary)
Statistical Analysis
Data Exploration and analysis for the stated problem & Given Dataset (Coding)
import pandas as pd
df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
df.head(5)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\3190502872.py:1: DtypeWarning: Columns (0,2,3,4,5,6,7,10,12,13,16,17,19) have mixed types. Specify dtype option on import or set low_memory=False.
df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | 2403.0 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | NaN | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | 3201.0 | Property Lost | PROPERTY - LOST | D14 | 795 | NaN | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
| 2 | I182080052 | 2647.0 | Other | THREATS TO DO BODILY HARM | B2 | 329 | NaN | 03-10-2018 19.20 | 2018.0 | 10.0 | Wednesday | 19.0 | Part Two | DEVON ST | 42.308126 | -71.076930 | (42.30812619, -71.07692974) | 03-10-2018 | 24.0 | female |
| 3 | I182080051 | 413.0 | Aggravated Assault | ASSAULT - AGGRAVATED - BATTERY | A1 | 92 | NaN | 03-10-2018 20.00 | 2018.0 | 10.0 | Wednesday | 20.0 | Part One | CAMBRIDGE ST | 42.359454 | -71.059648 | (42.35945371, -71.05964817) | 03-10-2018 | 56.0 | female |
| 4 | I182080050 | 3122.0 | Aircraft | AIRCRAFT INCIDENTS | A7 | 36 | NaN | 03-10-2018 20.49 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Three | PRESCOTT ST | 42.375258 | -71.024663 | (42.37525782, -71.02466343) | 03-10-2018 | 57.0 | male |
import pandas as pd
import numpy as np
# Calculate the total number of crimes that occurred in each neighborhood
neighborhood_crime_counts = df['DISTRICT'].value_counts()
# Calculate the probability of a crime occurring in a given neighborhood on a given day
neighborhood_crime_probabilities = neighborhood_crime_counts / df.shape[0]
# Print the probability of a crime occurring in the neighborhood with the highest crime rate
print(neighborhood_crime_probabilities.max())
0.09758455025448319
# Calculate the total number of crimes of each type
crime_type_counts = df['OFFENSE_CODE_GROUP'].value_counts()
# Calculate the probability of a certain type of crime being committed
crime_type_probabilities = crime_type_counts / df.shape[0]
# Print the probability of the most common type of crime being committed
print(crime_type_probabilities.max())
0.07255672358845075
# Calculate the number of crimes that were solved
df1 = df[df['SHOOTING'] == True].shape[0]
# Calculate the probability of a crime being solved
crime_shooting_probability = df1 / df.shape[0]
# Print the probability of a crime being solved
print(crime_shooting_probability)
0.0
# Convert to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])
# Create new column for day of week
df['DAY_OF_WEEK'] = df['DATE'].dt.day_name()
# Calculate total number of crimes
total_crimes = len(df)
# Calculate total number of crimes that occurred on weekdays
weekday_crimes = len(df[df['DAY_OF_WEEK'].isin(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])])
# Calculate probability of crimes occurring on weekdays
prob_weekday_crime = weekday_crimes / total_crimes
print(f"Probability of crimes occurring on weekdays: {prob_weekday_crime:.2f}")
# Convert to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])
# Create new column for year
df['YEAR'] = df['DATE'].dt.year
# Calculate total number of crimes
total_crimes = len(df)
# Calculate total number of crimes that occurred in each year
crimes_by_year = df.groupby('YEAR').size().reset_index(name='COUNT')
# Calculate probability of crimes occurring in each year
crimes_by_year['PROBABILITY'] = crimes_by_year['COUNT'] / total_crimes
print(crimes_by_year[['YEAR', 'PROBABILITY']])
# Convert the 'OCCURRED_ON_DATE' column to datetime format
df['DATE'] = pd.to_datetime(df['DATE'])
# Filter out crimes that occurred on weekends
df_weekend = df[df['DATE'].dt.dayofweek.isin([5, 6])]
# Calculate the probability of a crime being committed on a weekend
prob_weekend = len(df_weekend) / len(df)
print(f'The probability of a crime being committed on a weekend in Boston is {prob_weekend:.2%}.')
# Calculate the average crime hour in Boston
average_crime_hour = df['HOUR'].mean()
# Print the average crime hour in Boston
print(average_crime_hour)
13.114840461228724
# Calculate the median crime hour in Boston
median_crime_rate = df['HOUR'].median()
# Print the median crime hour in Boston
print(median_crime_rate)
14.0
# Group the data by district and count the number of offenses
offenses_by_district = df.groupby('DISTRICT')['INCIDENT_NUMBER'].count()
# Find the district with the highest number of offenses
district_with_most_offenses = offenses_by_district.idxmax()
print(f"The district with the highest number of offenses is {district_with_most_offenses}")
The district with the highest number of offenses is B2
# Count the number of occurrences of each crime type
crime_counts = df['OFFENSE_DESCRIPTION'].value_counts()
# Select the least common type of crime in Boston
least_common_crime = crime_counts.index[-1]
print(f'The least common type of crime in Boston is {least_common_crime}.')
The least common type of crime in Boston is DRUGS - POSS CLASS D - INTENT MFR DIST DISP.
# Count the number of occurrences of each crime type
crime_counts = df['OFFENSE_DESCRIPTION'].value_counts()
# Select the least common type of crime in Boston
most_common_crime = crime_counts.index[1]
print(f'The most common type of crime in Boston is {most_common_crime}.')
The most common type of crime in Boston is INVESTIGATE PERSON.
df.describe()
| OFFENSE_CODE | YEAR | MONTH | HOUR | Lat | Long | AGE | |
|---|---|---|---|---|---|---|---|
| count | 327820.000000 | 327820.000000 | 327820.000000 | 327820.000000 | 307188.000000 | 307188.000000 | 56911.000000 |
| mean | 2317.961171 | 2016.598676 | 6.672213 | 13.114840 | 42.212995 | -70.906030 | 39.059338 |
| std | 1184.990073 | 1.009775 | 3.253984 | 6.292714 | 2.173496 | 3.515832 | 12.400959 |
| min | 111.000000 | 2015.000000 | 1.000000 | 0.000000 | -1.000000 | -71.178674 | 18.000000 |
| 25% | 1001.000000 | 2016.000000 | 4.000000 | 9.000000 | 42.297466 | -71.097081 | 28.000000 |
| 50% | 2907.000000 | 2017.000000 | 7.000000 | 14.000000 | 42.325552 | -71.077493 | 39.000000 |
| 75% | 3201.000000 | 2017.000000 | 9.000000 | 18.000000 | 42.348624 | -71.062482 | 50.000000 |
| max | 3831.000000 | 2018.000000 | 12.000000 | 23.000000 | 42.395042 | -1.000000 | 86.000000 |
Dispersion of the data used to understands the distribution of data. It helps to understand the variation of data and provides a piece of information about the distribution data. Range, IQR, variance and standard deviations are the methods used to understand the distribution data.
# Calculate the standard deviation of the crime hour in Boston
crime_rate_standard_deviation = df['HOUR'].std()
# Print the standard deviation of the crime hour in Boston
print(crime_rate_standard_deviation)
6.292714255219991
# Calculate the variance of the 'Hour' column
variance = np.var(df['AGE'])
print(f'The variance of the criminal Age in Boston is {variance:.2f}.')
The variance of the criminal Age in Boston is 153.78.
print("Range of latitude is :",df['Lat'].max()-df['Lat'].min())
print("Range of longitude is :",df['Long'].max()-df['Long'].min())
Range of latitude is : 43.39504158 Range of longitude is : 70.17867378
# Calculate the IQR of the 'Hour' column
Q1 = df['AGE'].quantile(0.25)
Q3 = df['AGE'].quantile(0.75)
IQR = Q3 - Q1
print(f'The interquartile range (IQR) of the criminal age in Boston is {IQR:.2f}.')
The interquartile range (IQR) of the criminal age in Boston is 22.00.
Researchers that collect data during studies often find themselves with large sets of data that they need to simplify in order for them to communicate their findings to different audiences. To do this, they often use what is called a data distribution. A data distribution is a graphical representation of data that was collected from a sample or population. It is used to organize and disseminate large amounts of information in a way that is meaningful and simple for audiences to digest.
There are two types of data distribution based on two different kinds of data: Discrete and Continuous. Discrete data distributions include binomial distributions, Poisson distributions, and geometric distributions. Continuous data distributions include normal distributions and the Student's t-distribution.
A probability plot is used to determine the distribution of data. It is a test that graphs data points along a straight line. Data that fit along that line qualify as that given type of distribution.
%matplotlib inline
import matplotlib.pyplot as plt
df.plot(kind='hist')
<Axes: ylabel='Frequency'>
# Create a histogram plot for the crime data
plt.hist(df['OFFENSE_CODE'], bins=50)
plt.xlabel('Offense Code')
plt.ylabel('Frequency')
plt.title('Distribution of Crime Data in Boston')
plt.show()
# Convert the 'Date' column to a datetime object
df['DATE'] = pd.to_datetime(df['DATE'])
# Group the data by month and count the number of crimes
monthly_crime_counts = df.groupby(pd.Grouper(key='DATE', freq='M')).size()
# Create a scatter plot of the monthly crime counts
plt.scatter(monthly_crime_counts.index, monthly_crime_counts.values)
# Set the title and axis labels
plt.title('Monthly Crime Counts in Boston')
plt.xlabel('Month')
plt.ylabel('Number of Crimes')
# Display the plot
plt.show()
import seaborn as sns
sns.scatterplot(data=df)
import seaborn as sns
sns.distplot(df['OFFENSE_CODE'], kde = False, color ='red', bins = 30)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1320495804.py:2: UserWarning: `distplot` is a deprecated function and will be removed in seaborn v0.14.0. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). For a guide to updating your code to use the new functions, please see https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751 sns.distplot(df['OFFENSE_CODE'], kde = False, color ='red', bins = 30)
<Axes: xlabel='OFFENSE_CODE'>
A test statistic describes how closely the distribution of your data matches the distribution predicted under the null hypothesis of the statistical test you are using.
The distribution of data is how often each observation occurs, and can be described by its central tendency and variation around that central tendency. Different statistical tests predict different types of distributions, so it’s important to choose the right statistical test for your hypothesis.
The test statistic summarizes your observed data into a single number using the central tendency, variation, sample size, and number of predictor variables in your statistical model.
| Test Statistic | Formula | Finding |
|---|---|---|
| T-value for 1-sample t-test | Take the sample mean, subtract the hypothesized mean, and divide by the standard error of the mean. | |
| T-value for 2-sample t-test | Take one sample mean, subtract the other, and divide by the pooled standard deviation. | |
| F-value for F-tests and ANOVA | Calculate the ratio of two variances. | |
| Chi-squared value (χ2) for a Chi-squared test | Sum the squared differences between observed and expected values divided by the expected values. |
A t-test is a statistical hypothesis test that is used to determine whether there is a significant difference between the means of two groups. It helps you assess whether any observed differences between the groups are likely to have occurred by chance or if they are statistically significant.
Types of T-tests (With Solved Examples in R)
There are three types of t-tests we can perform based on the data at hand:
T test formula : The formula for the two-sample t test (a.k.a. the Student’s t-test) is shown below.
from scipy.stats import ttest_ind
# Split the dataset into two groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']
# Perform the t-test
t, p = ttest_ind(group1['OFFENSE_CODE'], group2['OFFENSE_CODE'])
# Print the results
print('t-value:', t)
print('p-value:', p)
t-value: 0.539141126234205 p-value: 0.5897911489474033
A Z-test is a statistical hypothesis test that is used to determine whether there is a significant difference between the sample mean and a known population mean when the population standard deviation is known. It is particularly useful when dealing with large sample sizes and normally distributed data. Z-tests are a parametric test, which means they make certain assumptions about the data, such as normality and known population standard deviation.
The formula for a Z-test statistic is:
from scipy.stats import norm
# Split the dataset into two groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']
# Calculate the mean and standard deviation of each group
mean1, std1 = group1['OFFENSE_CODE'].mean(), group1['OFFENSE_CODE'].std()
mean2, std2 = group2['OFFENSE_CODE'].mean(), group2['OFFENSE_CODE'].std()
# Calculate the standard error of the difference between means
se = ((std1 ** 2) / len(group1) + (std2 ** 2) / len(group2)) ** 0.5
z = (mean1 - mean2) / se
# Calculate the p-value
p = 1 - norm.cdf(abs(z))
# Print the results
print('z-value:', z)
print('p-value:', p)
z-value: 0.5323332640349332 p-value: 0.2972475987305382
ANOVA stands for Analysis of Variance, and it is a statistical technique used to analyze the differences among group means in a sample. ANOVA is especially useful when you want to compare the means of three or more groups or treatments to determine if there are significant differences between them. It helps in assessing whether the variation between group means is greater than what would be expected by random chance.
Formula for One-way ANOVA:
The formula for one-way ANOVA involves calculating the F-statistic, which follows an F-distribution. Here's the basic formula:
from scipy.stats import f_oneway
# Split the dataset into three groups
group1 = df[df['DISTRICT'] == 'D14']
group2 = df[df['DISTRICT'] == 'B2']
group3 = df[df['DISTRICT'] == 'A1']
# Perform the F-test
f, p = f_oneway(group1['OFFENSE_CODE'], group2['OFFENSE_CODE'], group3['OFFENSE_CODE'])
# Print the results
print('F-value:', f)
print('p-value:', p)
F-value: 482.9052727105152 p-value: 1.5978452466819635e-209
Chi-Square, often denoted as χ² (chi-squared), is a statistical test used to determine if there is a significant association or relationship between two categorical variables in a contingency table. It is a non-parametric test, meaning it does not rely on any assumptions about the distribution of the data, making it suitable for categorical data analysis.
The Chi-Square test formula is used to calculate the Chi-Square statistic, which quantifies the difference between the observed and expected frequencies in a contingency table. Here's the formula for the Chi-Square statistic in the context of a 2x2 contingency table:
from scipy.stats import chi2_contingency
# Create a contingency table for crime rates by neighborhood
crime_table = df.groupby('DISTRICT')['OFFENSE_CODE'].value_counts()
# Perform the Chi-Square test
chi2_statistic, p_value, degrees_of_freedom, expected_counts = chi2_contingency(crime_table)
# Print the results
print('Chi-Square statistic:', chi2_statistic)
print('p-value:', p_value)
print('Degrees of freedom:', degrees_of_freedom)
print('Expected counts:', expected_counts)
Chi-Square statistic: 0.0 p-value: 1.0 Degrees of freedom: 0 Expected counts: [2.220e+03 1.917e+03 1.912e+03 ... 1.000e+00 1.000e+00 1.000e+00]
Principal Component Analysis (PCA) is a widely used technique in statistics and data science for dimensionality reduction and data visualization. Its importance lies in several key applications and benefits:
Dimensionality Reduction: PCA is primarily used for reducing the dimensionality of large datasets while retaining as much of the original variability as possible. By transforming the data into a new set of variables (principal components), it eliminates redundant or less important information, making complex data more manageable and easier to analyze.
Data Visualization: PCA is a powerful tool for visualizing data in lower-dimensional spaces. It helps to project high-dimensional data onto a lower-dimensional subspace, making it possible to represent data in two or three dimensions, which can be easily visualized in scatter plots or other graphical forms.
Noise Reduction: In many datasets, there is noise or irrelevant information that can make analysis challenging. PCA can help remove this noise by focusing on the principal components that capture the most significant variation in the data.
Pattern Recognition and Clustering: PCA can be used as a preprocessing step for pattern recognition and clustering algorithms. It can help improve the performance of these techniques by reducing the feature space while retaining essential information.
Feature Engineering: In machine learning, feature engineering is a crucial step in model development. PCA can be used to create new features or reduce the dimensionality of feature sets, leading to more efficient and accurate models.
Multicollinearity Mitigation: In regression analysis, multicollinearity (high correlation among predictor variables) can lead to unstable coefficient estimates. PCA can address this issue by transforming the correlated predictors into orthogonal (uncorrelated) principal components.
Anomaly Detection: PCA can be used for anomaly or outlier detection by examining data points that deviate significantly from the expected pattern in the lower-dimensional subspace.
Compression: In data storage and transmission, PCA can be used to compress data while retaining critical information. This is particularly useful in scenarios where storage or bandwidth is limited.
Eigenvector and Eigenvalue Analysis: PCA is built on the mathematical concepts of eigenvectors and eigenvalues, which have applications beyond PCA, including physics, engineering, and quantum mechanics.
Interpretability: PCA often leads to more interpretable and understandable representations of data. The principal components can be analyzed to understand which original variables or features contribute the most to the variance.
Machine Learning: PCA can be integrated into machine learning pipelines as a preprocessing step to improve model performance, reduce overfitting, and speed up training.
# Impute missing values in the age column
df['OFFENSE_CODE'] = df['OFFENSE_CODE'].fillna(df['OFFENSE_CODE'].mean())
df.head()
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | 2403.0 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | NaN | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | 3201.0 | Property Lost | PROPERTY - LOST | D14 | 795 | NaN | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
| 2 | I182080052 | 2647.0 | Other | THREATS TO DO BODILY HARM | B2 | 329 | NaN | 03-10-2018 19.20 | 2018.0 | 10.0 | Wednesday | 19.0 | Part Two | DEVON ST | 42.308126 | -71.076930 | (42.30812619, -71.07692974) | 03-10-2018 | 24.0 | female |
| 3 | I182080051 | 413.0 | Aggravated Assault | ASSAULT - AGGRAVATED - BATTERY | A1 | 92 | NaN | 03-10-2018 20.00 | 2018.0 | 10.0 | Wednesday | 20.0 | Part One | CAMBRIDGE ST | 42.359454 | -71.059648 | (42.35945371, -71.05964817) | 03-10-2018 | 56.0 | female |
| 4 | I182080050 | 3122.0 | Aircraft | AIRCRAFT INCIDENTS | A7 | 36 | NaN | 03-10-2018 20.49 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Three | PRESCOTT ST | 42.375258 | -71.024663 | (42.37525782, -71.02466343) | 03-10-2018 | 57.0 | male |
# Impute missing values in the age column
df['AGE'] = df['AGE'].fillna(df['AGE'].mean())
df.head()
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | 2403.0 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | NaN | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | 3201.0 | Property Lost | PROPERTY - LOST | D14 | 795 | NaN | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
| 2 | I182080052 | 2647.0 | Other | THREATS TO DO BODILY HARM | B2 | 329 | NaN | 03-10-2018 19.20 | 2018.0 | 10.0 | Wednesday | 19.0 | Part Two | DEVON ST | 42.308126 | -71.076930 | (42.30812619, -71.07692974) | 03-10-2018 | 24.0 | female |
| 3 | I182080051 | 413.0 | Aggravated Assault | ASSAULT - AGGRAVATED - BATTERY | A1 | 92 | NaN | 03-10-2018 20.00 | 2018.0 | 10.0 | Wednesday | 20.0 | Part One | CAMBRIDGE ST | 42.359454 | -71.059648 | (42.35945371, -71.05964817) | 03-10-2018 | 56.0 | female |
| 4 | I182080050 | 3122.0 | Aircraft | AIRCRAFT INCIDENTS | A7 | 36 | NaN | 03-10-2018 20.49 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Three | PRESCOTT ST | 42.375258 | -71.024663 | (42.37525782, -71.02466343) | 03-10-2018 | 57.0 | male |
from sklearn.decomposition import PCA
import numpy as np
n=['OFFENSE_CODE','AGE']
data_stats=df[n]
pca = PCA(n_components=0.9)
principal_components = pca.fit_transform(data_stats)
num_components = pca.n_components_
num_components
1
!pip install plotly
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import plotly.express as px
import scipy.stats as stat
import seaborn as sns
import pandas as pd
import numpy as np
import calendar
import json
crime = df['OFFENSE_CODE_GROUP'].value_counts().index[:10]
crime_count = df['OFFENSE_CODE_GROUP'].value_counts().values[:10]
plt.figure(figsize=(12,8))
ax = sns.barplot(y = crime , x = crime_count, orient='h', palette='Reds_r')
plt.xlabel(xlabel='OFFENSE_NAME')
plt.ylabel(ylabel='OFFENSE_GROUP')
plt.title("OFFENSE_DESCRIPTION", fontdict = {'size' : 'xx-large', 'fontweight' : 'bold'})
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\248667759.py:5: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. ax = sns.barplot(y = crime , x = crime_count, orient='h', palette='Reds_r')
labels = df['UCR_PART'].astype('category').cat.categories.tolist()
counts = df['UCR_PART'].value_counts()
sizes = [counts[var_cat] for var_cat in labels]
fig1, ax1 = plt.subplots(figsize = (22,12))
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=140, textprops={'color':"black", 'size' : 'x-large', 'fontweight' : 'bold'})
ax1.axis('equal')
plt.show()
order = df['OFFENSE_CODE_GROUP'].value_counts().head(15).index
plt.figure(figsize = (30,10))
sns.countplot(df, x='OFFENSE_CODE_GROUP',hue=df.DISTRICT, order = order ,palette="cubehelix");
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks(rotation=90)
plt.show()
df['YEAR'] = df['DATE'].str[-4:]
yeared=df.groupby("YEAR").size()
yeared.plot(kind="line",color="green",linewidth=4)
plt.title("Crime in Boston by Year")
plt.ylabel("Number of Crimes")
plt.show()
fig = px.histogram(df, x=['YEAR'], template='plotly_white',
opacity=0.7,log_y=True, labels={'x':'YEARS', 'y':'Case Number Count'} )
fig.update_layout(coloraxis=dict(colorscale='Bluered_r'), showlegend=False)
fig.show()
plt.figure(figsize=(12, 6))
sns.barplot(x='UCR_PART', y='DISTRICT', data=df, palette='Set1')
plt.title('UCR_PART')
plt.xlabel('DISTRICT')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\698744356.py:2: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
fig, ax =plt.subplots(1,4, figsize=(30, 6), sharey=False)
sns.countplot(df['YEAR'], ax=ax[0])
sns.countplot(df['MONTH'], ax=ax[1])
sns.countplot(df['DAY_OF_WEEK'], ax=ax[2])
sns.countplot(df['HOUR'], ax=ax[3])
fig.show()
sns.pairplot(df)
import plotly.express as pt
import pandas as pd
fig = pt.violin(df, y="YEAR")
fig.show()
fig, ax = plt.subplots(figsize=(10, 6))
week_and_hour = df.groupby(['HOUR', 'DAY_OF_WEEK']).count()['OFFENSE_CODE_GROUP'].unstack()
week_and_hour.columns = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap = sns.heatmap(week_and_hour, cmap=sns.cubehelix_palette(as_cmap=True), ax=ax)
heatmap.set_yticklabels(heatmap.get_yticklabels(), rotation=0)
plt.xlabel('')
plt.ylabel('Hour')
plt.show()
sns.catplot(y='OFFENSE_CODE_GROUP',
kind='count',
height=11,
aspect=2,
order=df.OFFENSE_CODE_GROUP.value_counts().index,
data=df)
<seaborn.axisgrid.FacetGrid at 0x1e3dfdbeb60>
import matplotlib as mpl
df.Lat.replace(-1, None, inplace=True)
df.Long.replace(-1, None, inplace=True)
mpl.rcParams["figure.figsize"] = 21,11
plt.subplots(figsize=(11,6))
sns.scatterplot(x='Lat', y='Long', alpha=0.1, data=df)
plt.legend(loc=2)
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
<matplotlib.legend.Legend at 0x1e3fc0edba0>
order = df['OFFENSE_CODE_GROUP'].value_counts().head(15).index
g = sns.FacetGrid(data = df, hue = "MONTH", height = 5)
g.map(sns.kdeplot, "OFFENSE_CODE", shade = True)
g.add_legend()
plt.figure(figsize = (30,10))
sns.countplot(data = df, x='OFFENSE_CODE_GROUP', order = order ,palette="cubehelix");
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xticks(rotation=90)
plt.show()
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1189323747.py:6: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect. No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
import seaborn as sns
import pandas as pd
# Define the data
data = pd.DataFrame({'Group': ['A', 'B', 'C', 'D'] * 3,
'MONTH': ['Jan', 'Feb', 'Mar'] * 4,
'Value': [1, 3, 2, 5, 6, 8, 9, 12, 11, 14, 13, 15]})
# Create the FacetGrid object
g = sns.FacetGrid(data=data, hue="MONTH", height=5)
# Plot the kdeplot
g.map(sns.kdeplot, "Value", shade=True).add_legend();
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code. C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\seaborn\axisgrid.py:854: FutureWarning: `shade` is now deprecated in favor of `fill`; setting `fill=True`. This will become an error in seaborn v0.14.0; please update your code.
!pip install folium
import folium
from folium.plugins import HeatMap
B2_district=df.loc[df.DISTRICT=='B2'][['Lat','Long']]
B2_district.Lat.fillna(0, inplace = True)
B2_district.Long.fillna(0, inplace = True)
map_1=folium.Map(location=[42.356145,-71.064083],
tiles = "OpenStreetMap",
zoom_start=11)
folium.CircleMarker([42.319945,-71.079989],
radius=70,
fill_color="#b22222",
popup='Homicide',
color='red',
).add_to(map_1)
HeatMap(data=B2_district, radius=16).add_to(map_1)
map_1
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
df = pd.read_csv('C:/Users/jesy jeff laura.e/OneDrive/Desktop/CRIMEB-2.csv' ,encoding='latin-1')
df.head(5)
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\3190502872.py:1: DtypeWarning: Columns (0,2,3,4,5,6,7,10,12,13,16,17,19) have mixed types. Specify dtype option on import or set low_memory=False.
| INCIDENT_NUMBER | OFFENSE_CODE | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | 2403.0 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | NaN | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | 3201.0 | Property Lost | PROPERTY - LOST | D14 | 795 | NaN | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
| 2 | I182080052 | 2647.0 | Other | THREATS TO DO BODILY HARM | B2 | 329 | NaN | 03-10-2018 19.20 | 2018.0 | 10.0 | Wednesday | 19.0 | Part Two | DEVON ST | 42.308126 | -71.076930 | (42.30812619, -71.07692974) | 03-10-2018 | 24.0 | female |
| 3 | I182080051 | 413.0 | Aggravated Assault | ASSAULT - AGGRAVATED - BATTERY | A1 | 92 | NaN | 03-10-2018 20.00 | 2018.0 | 10.0 | Wednesday | 20.0 | Part One | CAMBRIDGE ST | 42.359454 | -71.059648 | (42.35945371, -71.05964817) | 03-10-2018 | 56.0 | female |
| 4 | I182080050 | 3122.0 | Aircraft | AIRCRAFT INCIDENTS | A7 | 36 | NaN | 03-10-2018 20.49 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Three | PRESCOTT ST | 42.375258 | -71.024663 | (42.37525782, -71.02466343) | 03-10-2018 | 57.0 | male |
df =df.drop('OFFENSE_CODE', axis=1)
df.head(2)
| INCIDENT_NUMBER | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | SHOOTING | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | NaN | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | Property Lost | PROPERTY - LOST | D14 | 795 | NaN | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
df =df.drop('SHOOTING', axis=1)
df.head(2)
| INCIDENT_NUMBER | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I182080058 | Disorderly Conduct | DISTURBING THE PEACE | E18 | 495 | 03-10-2018 20.13 | 2018.0 | 10.0 | Wednesday | 20.0 | Part Two | ARLINGTON ST | 42.262608 | -71.121186 | (42.26260773, -71.12118637) | 03-10-2018 | 23.0 | male |
| 1 | I182080053 | Property Lost | PROPERTY - LOST | D14 | 795 | 30-08-2018 20.00 | 2018.0 | 8.0 | Thursday | 20.0 | Part Three | ALLSTON ST | 42.352111 | -71.135311 | (42.35211146, -71.13531147) | 30-08-2018 | 18.0 | female |
df.shape
(525575, 18)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 525575 entries, 0 to 525574 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 INCIDENT_NUMBER 327820 non-null object 1 OFFENSE_CODE_GROUP 327820 non-null object 2 OFFENSE_DESCRIPTION 327820 non-null object 3 DISTRICT 326046 non-null object 4 REPORTING_AREA 327820 non-null object 5 OCCURRED_ON_DATE 327820 non-null object 6 YEAR 327820 non-null float64 7 MONTH 327820 non-null float64 8 DAY_OF_WEEK 327820 non-null object 9 HOUR 327820 non-null float64 10 UCR_PART 327727 non-null object 11 STREET 316843 non-null object 12 Lat 307188 non-null float64 13 Long 307188 non-null float64 14 Location 327820 non-null object 15 DATE 327820 non-null object 16 AGE 56911 non-null float64 17 Sex 891 non-null object dtypes: float64(6), object(12) memory usage: 72.2+ MB
df.isnull().sum()
INCIDENT_NUMBER 197755 OFFENSE_CODE_GROUP 197755 OFFENSE_DESCRIPTION 197755 DISTRICT 199529 REPORTING_AREA 197755 OCCURRED_ON_DATE 197755 YEAR 197755 MONTH 197755 DAY_OF_WEEK 197755 HOUR 197755 UCR_PART 197848 STREET 208732 Lat 218387 Long 218387 Location 197755 DATE 197755 AGE 468664 Sex 524684 dtype: int64
df=df.fillna(method='bfill')
df.isnull().sum()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1456028691.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
INCIDENT_NUMBER 197755 OFFENSE_CODE_GROUP 197755 OFFENSE_DESCRIPTION 197755 DISTRICT 197755 REPORTING_AREA 197755 OCCURRED_ON_DATE 197755 YEAR 197755 MONTH 197755 DAY_OF_WEEK 197755 HOUR 197755 UCR_PART 197755 STREET 197755 Lat 197755 Long 197755 Location 197755 DATE 197755 AGE 192578 Sex 524684 dtype: int64
df=df.fillna(method='ffill')
df.isnull().sum()
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\4175291397.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
INCIDENT_NUMBER 0 OFFENSE_CODE_GROUP 0 OFFENSE_DESCRIPTION 0 DISTRICT 0 REPORTING_AREA 0 OCCURRED_ON_DATE 0 YEAR 0 MONTH 0 DAY_OF_WEEK 0 HOUR 0 UCR_PART 0 STREET 0 Lat 0 Long 0 Location 0 DATE 0 AGE 0 Sex 0 dtype: int64
df.plot(kind='box')
plt.xticks(rotation=90)
plt.show()
#Removing outliers
#find iqr
q1=df['MONTH'].quantile(0.25)
q3=df['MONTH'].quantile(0.75)
print("Q1=",q1)
print("Q3=",q3)
iqr=q3-q1
print("iqr=",(q3-q1))
# calculate upperlimit and lowerlimit
lower=q1-1.5*iqr
upper=q3+1.5*iqr
print("Lower limits= ",lower)
print("upper limits= ",upper)
df[df['MONTH'] > upper]
df[df['MONTH'] < lower]
df=df[df['MONTH'] < upper]
df['MONTH'].describe()
Q1= 6.0 Q3= 8.0 iqr= 2.0 Lower limits= 3.0 upper limits= 11.0
count 478406.000000 mean 5.918555 std 2.133930 min 1.000000 25% 5.000000 50% 6.000000 75% 7.000000 max 10.000000 Name: MONTH, dtype: float64
df['MONTH'].plot(kind='box')
plt.show()
df.shape
(478406, 18)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 478406 entries, 0 to 525574 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 INCIDENT_NUMBER 478406 non-null object 1 OFFENSE_CODE_GROUP 478406 non-null object 2 OFFENSE_DESCRIPTION 478406 non-null object 3 DISTRICT 478406 non-null object 4 REPORTING_AREA 478406 non-null object 5 OCCURRED_ON_DATE 478406 non-null object 6 YEAR 478406 non-null float64 7 MONTH 478406 non-null float64 8 DAY_OF_WEEK 478406 non-null object 9 HOUR 478406 non-null float64 10 UCR_PART 478406 non-null object 11 STREET 478406 non-null object 12 Lat 478406 non-null float64 13 Long 478406 non-null float64 14 Location 478406 non-null object 15 DATE 478406 non-null object 16 AGE 478406 non-null float64 17 Sex 478406 non-null object dtypes: float64(6), object(12) memory usage: 69.3+ MB
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['INCIDENT_NUMBER']=le.fit_transform(df['INCIDENT_NUMBER'])
df['OFFENSE_CODE_GROUP']=le.fit_transform(df['OFFENSE_CODE_GROUP'])
df['DAY_OF_WEEK']=le.fit_transform(df['DAY_OF_WEEK'])
df['OFFENSE_DESCRIPTION']=le.fit_transform(df['OFFENSE_DESCRIPTION'])
df['DISTRICT']=le.fit_transform(df['DISTRICT'])
df['REPORTING_AREA']=le.fit_transform(df['REPORTING_AREA'])
df['OCCURRED_ON_DATE']=le.fit_transform(df['OCCURRED_ON_DATE'])
df['UCR_PART']=le.fit_transform(df['UCR_PART'])
df['STREET']=le.fit_transform(df['STREET'])
df['Location']=le.fit_transform(df['Location'])
df['DATE']=le.fit_transform(df['DATE'])
df['Sex']=le.fit_transform(df['Sex'])
df
C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy C:\Users\jesy jeff laura.e\AppData\Local\Temp\ipykernel_10600\1699234579.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| INCIDENT_NUMBER | OFFENSE_CODE_GROUP | OFFENSE_DESCRIPTION | DISTRICT | REPORTING_AREA | OCCURRED_ON_DATE | YEAR | MONTH | DAY_OF_WEEK | HOUR | UCR_PART | STREET | Lat | Long | Location | DATE | AGE | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 248345 | 14 | 56 | 10 | 439 | 20428 | 2018.0 | 10.0 | 6 | 20.0 | 3 | 222 | 42.262608 | -71.121186 | 874 | 101 | 23.0 | 1 |
| 1 | 248344 | 52 | 175 | 7 | 770 | 198500 | 2018.0 | 8.0 | 4 | 20.0 | 2 | 129 | 42.352111 | -71.135311 | 14291 | 996 | 18.0 | 0 |
| 2 | 248343 | 46 | 209 | 3 | 256 | 20419 | 2018.0 | 10.0 | 6 | 19.0 | 3 | 1222 | 42.308126 | -71.076930 | 6815 | 101 | 24.0 | 0 |
| 3 | 248342 | 0 | 16 | 0 | 835 | 20427 | 2018.0 | 10.0 | 6 | 20.0 | 1 | 695 | 42.359454 | -71.059648 | 15495 | 101 | 56.0 | 0 |
| 4 | 248341 | 1 | 4 | 2 | 290 | 20430 | 2018.0 | 10.0 | 6 | 20.0 | 2 | 3297 | 42.375258 | -71.024663 | 16736 | 101 | 57.0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 525570 | 0 | 66 | 226 | 8 | 817 | 142503 | 2015.0 | 6.0 | 1 | 0.0 | 2 | 4275 | 42.333839 | -71.080290 | 10948 | 718 | 47.0 | 1 |
| 525571 | 0 | 66 | 226 | 8 | 817 | 142503 | 2015.0 | 6.0 | 1 | 0.0 | 2 | 4275 | 42.333839 | -71.080290 | 10948 | 718 | 47.0 | 1 |
| 525572 | 0 | 66 | 226 | 8 | 817 | 142503 | 2015.0 | 6.0 | 1 | 0.0 | 2 | 4275 | 42.333839 | -71.080290 | 10948 | 718 | 47.0 | 1 |
| 525573 | 0 | 66 | 226 | 8 | 817 | 142503 | 2015.0 | 6.0 | 1 | 0.0 | 2 | 4275 | 42.333839 | -71.080290 | 10948 | 718 | 47.0 | 1 |
| 525574 | 0 | 66 | 226 | 8 | 817 | 142503 | 2015.0 | 6.0 | 1 | 0.0 | 2 | 4275 | 42.333839 | -71.080290 | 10948 | 718 | 47.0 | 1 |
478406 rows × 18 columns
df['REPORTING_AREA'].value_counts()
REPORTING_AREA
817 198266
0 18273
16 2076
98 1760
256 1682
...
663 8
103 4
715 2
133 1
864 1
Name: count, Length: 880, dtype: int64
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 478406 entries, 0 to 525574 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 INCIDENT_NUMBER 478406 non-null int32 1 OFFENSE_CODE_GROUP 478406 non-null int32 2 OFFENSE_DESCRIPTION 478406 non-null int32 3 DISTRICT 478406 non-null int32 4 REPORTING_AREA 478406 non-null int32 5 OCCURRED_ON_DATE 478406 non-null int32 6 YEAR 478406 non-null float64 7 MONTH 478406 non-null float64 8 DAY_OF_WEEK 478406 non-null int32 9 HOUR 478406 non-null float64 10 UCR_PART 478406 non-null int32 11 STREET 478406 non-null int32 12 Lat 478406 non-null float64 13 Long 478406 non-null float64 14 Location 478406 non-null int32 15 DATE 478406 non-null int32 16 AGE 478406 non-null float64 17 Sex 478406 non-null int32 dtypes: float64(6), int32(12) memory usage: 47.4 MB
# vertical split
x = df.drop('REPORTING_AREA', axis=1).values
x
array([[2.48345e+05, 1.40000e+01, 5.60000e+01, ..., 1.01000e+02,
2.30000e+01, 1.00000e+00],
[2.48344e+05, 5.20000e+01, 1.75000e+02, ..., 9.96000e+02,
1.80000e+01, 0.00000e+00],
[2.48343e+05, 4.60000e+01, 2.09000e+02, ..., 1.01000e+02,
2.40000e+01, 0.00000e+00],
...,
[0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
4.70000e+01, 1.00000e+00],
[0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
4.70000e+01, 1.00000e+00],
[0.00000e+00, 6.60000e+01, 2.26000e+02, ..., 7.18000e+02,
4.70000e+01, 1.00000e+00]])
x.shape
(478406, 17)
y = df['UCR_PART'].values.reshape(-1,1)
y
array([[3],
[2],
[3],
...,
[2],
[2],
[2]])
#horizontal split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=0)
print("Shape of X_train: ",x_train.shape)
print("Shape of X_test: ", x_test.shape)
print("Shape of y_train: ",y_train.shape)
print("Shape of y_test",y_test.shape)
Shape of X_train: (334884, 17) Shape of X_test: (143522, 17) Shape of y_train: (334884, 1) Shape of y_test (143522, 1)
from sklearn.linear_model import LinearRegression
regressor_linear = LinearRegression()
regressor_linear.fit(x_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
import scipy.stats as stats
import statsmodels.formula.api as smf
#intercept
c = regressor_linear.intercept_
#coefficient
m = regressor_linear.coef_
print("Intercept = ",c)
print("Coefficent = ",m)
Intercept = [-2.18454588e-10] Coefficent = [[-1.68214167e-18 -4.44267510e-15 3.93141482e-16 5.93437273e-15 -1.91858755e-18 1.07905236e-13 1.34032166e-14 -6.75012205e-15 -3.39157268e-15 1.00000000e+00 -7.93802697e-19 2.48844355e-14 1.54933020e-14 1.85944950e-19 4.45645683e-16 -6.36305424e-17 -1.29505729e-15]]
print(x_test)
[[0.00000e+00 6.60000e+01 2.26000e+02 ... 7.18000e+02 4.70000e+01 1.00000e+00] [4.08720e+04 1.50000e+01 7.10000e+01 ... 1.71000e+02 3.70000e+01 1.00000e+00] [7.33510e+04 5.50000e+01 2.06000e+02 ... 7.19000e+02 2.10000e+01 1.00000e+00] ... [0.00000e+00 6.60000e+01 2.26000e+02 ... 7.18000e+02 4.70000e+01 1.00000e+00] [2.37344e+05 4.60000e+01 2.09000e+02 ... 6.61000e+02 2.30000e+01 1.00000e+00] [4.37880e+04 4.60000e+01 2.09000e+02 ... 6.38000e+02 3.70000e+01 1.00000e+00]]
print(y_test)
[[2] [3] [3] ... [2] [3] [3]]
y_pred=regressor_linear.predict(x_test)
y_pred
array([[2.],
[3.],
[3.],
...,
[2.],
[3.],
[3.]])
y_test=y_test.ravel()
y_pred=y_pred.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
| Actual | Predicted | |
|---|---|---|
| 0 | 2 | 2.0 |
| 1 | 3 | 3.0 |
| 2 | 3 | 3.0 |
| 3 | 2 | 2.0 |
| 4 | 2 | 2.0 |
| ... | ... | ... |
| 143517 | 3 | 3.0 |
| 143518 | 2 | 2.0 |
| 143519 | 2 | 2.0 |
| 143520 | 3 | 3.0 |
| 143521 | 3 | 3.0 |
143522 rows × 2 columns
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()
print("Training score: ",regressor_linear.score(x_train,y_train))
print("Testing score: ",regressor_linear.score(x_test,y_test))
Training score: 1.0 Testing score: 1.0
print("accuracy: ",regressor_linear.score(x,y)*100)
accuracy: 100.0
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error
# Predicting Cross Validation Score the Test set results
cv_linear = cross_val_score(estimator = regressor_linear, X = x_train, y = y_train, cv = 10)
# Predicting R2 Score the Train set results
y_pred_linear_train = regressor_linear.predict(x_train)
r2_score_linear_train = r2_score(y_train, y_pred_linear_train)
# Predicting R2 Score the Test set results
y_pred_linear_test = regressor_linear.predict(x_test)
r2_score_linear_test = r2_score(y_test, y_pred_linear_test)
# Predicting RMSE the Test set results
rmse_linear = (np.sqrt(mean_squared_error(y_test, y_pred_linear_test)))
print("CV: ", cv_linear.mean())
print('R2_score (train): ', r2_score_linear_train)
print('R2_score (test): ', r2_score_linear_test)
print("RMSE: ", rmse_linear)
CV: 1.0 R2_score (train): 1.0 R2_score (test): 1.0 RMSE: 2.950017704202002e-13
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(x_train, y_train)
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
LogisticRegression(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(random_state=0)
y_pred = classifier.predict(x_test)
print(y_pred)
[2 2 2 ... 2 2 2]
y
array([[3],
[2],
[3],
...,
[2],
[2],
[2]])
y_test=y_test.ravel()
y_pred=y_pred.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
| Actual | Predicted | |
|---|---|---|
| 0 | 2 | 2 |
| 1 | 3 | 2 |
| 2 | 3 | 2 |
| 3 | 2 | 2 |
| 4 | 2 | 2 |
| ... | ... | ... |
| 143517 | 3 | 2 |
| 143518 | 2 | 2 |
| 143519 | 2 | 2 |
| 143520 | 3 | 2 |
| 143521 | 3 | 2 |
143522 rows × 2 columns
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()
print("Training score: ",classifier.score(x_train,y_train))
print("Testing score: ",classifier.score(x_test,y_test))
Training score: 0.6862435947970044 Testing score: 0.685455888295871
print("accuracy: ",classifier.score(x,y)*100)
accuracy: 68.60072825173597
import warnings
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error
# Predicting Cross Validation Score the Test set results
cv_classifier = cross_val_score(estimator = classifier, X = x_train, y = y_train, cv = 10)
# Predicting R2 Score the Train set results
y_pred_classifier_train = classifier.predict(x_train)
r2_score_classifier_train = r2_score(y_train, y_pred_linear_train)
# Predicting R2 Score the Test set results
y_pred_classifier_test = classifier.predict(x_test)
r2_score_classifier_test = r2_score(y_test, y_pred_classifier_test)
# Predicting RMSE the Test set results
rmse_classifier = (np.sqrt(mean_squared_error(y_test, y_pred_classifier_test)))
print("CV: ", cv_classifier.mean())
print('R2_score (train): ', r2_score_classifier_train)
print('R2_score (test): ', r2_score_classifier_test)
print("RMSE: ", rmse_classifier)
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1183: DataConversionWarning:
A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
CV: 0.6859270689813117 R2_score (train): 1.0 R2_score (test): -0.23202899548422828 RMSE: 0.6061928699443855
C:\Users\jesy jeff laura.e\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\linear_model\_logistic.py:460: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(x_train, y_train)
Lasso(alpha=0.1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=0.1)
y_pred = lasso.predict(x_test)
print(y_pred)
[2.01144291 2.65454914 2.72569111 ... 2.01144291 2.69855938 2.6412486 ]
y
array([[3],
[2],
[3],
...,
[2],
[2],
[2]])
y_test=y_test.ravel()
y_pred=y_pred.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
actual_pred
| Actual | Predicted | |
|---|---|---|
| 0 | 2 | 2.011443 |
| 1 | 3 | 2.654549 |
| 2 | 3 | 2.725691 |
| 3 | 2 | 2.011443 |
| 4 | 2 | 1.995636 |
| ... | ... | ... |
| 143517 | 3 | 2.646148 |
| 143518 | 2 | 2.030663 |
| 143519 | 2 | 2.011443 |
| 143520 | 3 | 2.698559 |
| 143521 | 3 | 2.641249 |
143522 rows × 2 columns
include=10
actual=actual_pred.head(include)
actual.plot(kind='bar')
plt.grid()
plt.show()
print("Training score: ",lasso.score(x_train,y_train))
print("Testing score: ",lasso.score(x_test,y_test))
Training score: 0.8847571362974157 Testing score: 0.8846696817277504
print("accuracy: ",lasso.score(x,y)*100)
accuracy: 88.47309296424297
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score
import numpy as np
from sklearn.metrics import mean_squared_error
# Predicting Cross Validation Score the Test set results
cv_lasso = cross_val_score(estimator = lasso, X = x_train, y = y_train, cv = 10)
# Predicting R2 Score the Train set results
y_pred_lasso_train = lasso.predict(x_train)
r2_score_lasso_train = r2_score(y_train, y_pred_linear_train)
# Predicting R2 Score the Test set results
y_pred_lasso_test = lasso.predict(x_test)
r2_score_lasso_test = r2_score(y_test, y_pred_lasso_test)
# Predicting RMSE the Test set results
rmse_lasso = (np.sqrt(mean_squared_error(y_test, y_pred_lasso_test)))
print("CV: ", cv_lasso.mean())
print('R2_score (train): ', r2_score_lasso_train)
print('R2_score (test): ', r2_score_lasso_test)
print("RMSE: ", rmse_lasso)
CV: 0.8847480531974219 R2_score (train): 1.0 R2_score (test): 0.8846696817277504 RMSE: 0.18546933066976745
# Fitting the Decision Tree Regression Model to the dataset
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state = 0)
dt.fit(x_train, y_train)
DecisionTreeRegressor(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(random_state=0)
y_pred_dt_test = dt.predict(x_test)
y_pred_dt_test
array([2., 3., 3., ..., 2., 3., 3.])
y_test=y_test.ravel()
y_pred_dt_test =y_pred_dt_test.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_dt_test})
actual_pred
| Actual | Predicted | |
|---|---|---|
| 0 | 2 | 2.0 |
| 1 | 3 | 3.0 |
| 2 | 3 | 3.0 |
| 3 | 2 | 2.0 |
| 4 | 2 | 2.0 |
| ... | ... | ... |
| 143517 | 3 | 3.0 |
| 143518 | 2 | 2.0 |
| 143519 | 2 | 2.0 |
| 143520 | 3 | 3.0 |
| 143521 | 3 | 3.0 |
143522 rows × 2 columns
print("Training score: ",dt.score(x_train,y_train))
print("Testing score: ",dt.score(x_test,y_test))
Training score: 1.0 Testing score: 1.0
print("accuracy: ",dt.score(x,y)*100)
accuracy: 100.0
from sklearn.metrics import r2_score
# Predicting Cross Validation Score
cv_dt = cross_val_score(estimator = dt, X = x_train, y = y_train, cv = 10)
# Predicting R2 Score the Train set results
y_pred_dt_train = dt.predict(x_train)
r2_score_dt_train = r2_score(y_train, y_pred_dt_train)
# Predicting R2 Score the Test set results
y_pred_dt_test = dt.predict(x_test)
r2_score_dt_test = r2_score(y_test, y_pred_dt_test)
# Predicting RMSE the Test set results
rmse_dt = (np.sqrt(mean_squared_error(y_test, y_pred_dt_test)))
print('CV: ', cv_dt.mean())
print('R2_score (train): ', r2_score_dt_train)
print('R2_score (test): ', r2_score_dt_test)
print("RMSE: ", rmse_dt)
CV: 1.0 R2_score (train): 1.0 R2_score (test): 1.0 RMSE: 0.0
# Fitting the Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 500, random_state = 0)
rf.fit(x_train, y_train.ravel())
RandomForestRegressor(n_estimators=500, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestRegressor(n_estimators=500, random_state=0)
y_pred_rf_test = rf.predict(x_test)
y_pred_rf_test
array([2., 3., 3., ..., 2., 3., 3.])
y_test=y_test.ravel()
y_pred_dt_test =y_pred_rf_test.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_rf_test})
actual_pred
| Actual | Predicted | |
|---|---|---|
| 0 | 2 | 2.0 |
| 1 | 3 | 3.0 |
| 2 | 3 | 3.0 |
| 3 | 2 | 2.0 |
| 4 | 2 | 2.0 |
| ... | ... | ... |
| 143517 | 3 | 3.0 |
| 143518 | 2 | 2.0 |
| 143519 | 2 | 2.0 |
| 143520 | 3 | 3.0 |
| 143521 | 3 | 3.0 |
143522 rows × 2 columns
print("Training score: ",rf.score(x_train,y_train))
print("Testing score: ",rf.score(x_test,y_test))
Training score: 1.0 Testing score: 1.0
print("accuracy: ",rf.score(x,y)*100)
accuracy: 100.0
from sklearn.metrics import r2_score
# Predicting Cross Validation Score
cv_rf = cross_val_score(estimator = rf, X = x_train, y = y_train.ravel(), cv = 10)
# Predicting R2 Score the Train set results
y_pred_rf_train = rf.predict(x_train)
r2_score_rf_train = r2_score(y_train, y_pred_rf_train)
# Predicting R2 Score the Test set results
y_pred_rf_test = rf.predict(x_test)
r2_score_rf_test = r2_score(y_test, y_pred_rf_test)
# Predicting RMSE the Test set results
rmse_rf = (np.sqrt(mean_squared_error(y_test, y_pred_rf_test)))
print('CV: ', cv_rf.mean())
print('R2_score (train): ', r2_score_rf_train)
print('R2_score (test): ', r2_score_rf_test)
print("RMSE: ", rmse_rf)
# Importing the required libraries
from sklearn.svm import SVR
# Creating an instance of the model
svr=SVR()
# Fitting the model to the training data
svr.fit(x_train,y_train)
y_pred_svr_test = svr.predict(x_test)
y_pred_svr_test
y_test=y_test.ravel()
y_pred_svr_test =y_pred_svr_test.ravel()
actual_pred=pd.DataFrame({'Actual':y_test,'Predicted':y_pred_svr_test})
actual_pred
print("Training score: ",svr.score(x_train,y_train))
print("Testing score: ",svr.score(x_test,y_test))
print("accuracy: ",svr.score(x,y)*100)
from sklearn.metrics import r2_score
# Predicting Cross Validation Score
cv_svr = cross_val_score(estimator = svr, X = x_train, y = y_train.ravel(), cv = 10)
# Predicting R2 Score the Train set results
y_pred_svr_train = svr.predict(x_train)
r2_score_svr_train = r2_score(y_train, y_pred_svr_train)
# Predicting R2 Score the Test set results
y_pred_svr_test = svr.predict(x_test)
r2_score_svr_test = r2_score(y_test, y_pred_svr_test)
# Predicting RMSE the Test set results
rmse_svr = (np.sqrt(mean_squared_error(y_test, y_pred_svr_test)))
print('CV: ', cv_svr.mean())
print('R2_score (train): ', r2_score_svr_train)
print('R2_score (test): ', r2_score_svr_test)
print("RMSE: ", rmse_svr)
import tensorflow as tf
import numpy as np
# Define your model
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(1)
])
# Compile your model
model.compile(optimizer='adam', loss='mse')
# Make predictions on the test set
y_pred = model.predict(x_test)
# Make predictions on the test set
y_pred = model.predict(x_test)